Synchronous sublattice algorithm for parallel kinetic Monte Carlo 
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The standard kinetic Monte Carlo algorithm is an extremely efficient method to carry out serial 
simulations of dynamical processes such as thin-film growth. However, in some cases it is necessary 
to study systems over extended time and length scales, and therefore a parallel algorithm is desired. 
Here we describe an efficient, semi- rigorous synchronous sublattice algorithm for parallel kinetic 
Monte Carlo simulations. The accuracy and parallel efficiency are studied as a function of diffusion 
rate, processor size, and number of processors for a variety of simple models of epitaxial growth. The 
effects of fluctuations on the parallel efficiency are also studied. Since only local communications 
are required, linear scaling behavior is observed, e.g. the parallel efficiency is independent of the 
number of processors for fixed processor size. 



PACS numbers: 81.15.Aa, 05.10.Ln, 05.10.-a, 89.20.Ff 



I. INTRODUCTION 

Kinetic Monte Carlo (KMC) is an extremely efficient 
methooji*2iiM£ to carry out dynamical simulations of 
stochastic and/or thermally activated processes when 
the relevant activated atomic-scale processes are known. 
KMC simulations have been successfully used to model 
a variety of dynamical processes ranging from catalysis 
to thin-film growth. The basic principle of kinetic Monte 
Carlo is that in order to efficiently simulate a dynamical 
system with a variety of different rates or processes, at 
each step in the simulation one picks the next process to 
occur with a probability proportional to the rate for that 
process. The time of the next event is determined by the 
total overall rate for all processes to occur, and after each 
event the rates for all processes are updated as necessary. 

In contrast to Metropolis Monte Carlop in which 
each Monte Carlo step corresponds to a configuration- 
independent time interval and each event is selected 
randomly but only accepted with a configuration- 
dependent probability, in kinetic Monte Carlo both the 
selected event and the time interval between events are 
configuration-dependent while the acceptance probabil- 
ity is fixed (all attempts are accepted). In the context 
of traditional equilibrium Monte Carlo simulations this 
is sometimes referred to as the n-fold wayi Although 
KMC requires additional book-keeping to keep track of 
the rates (probabilities) for all possible events, the KMC 
algorithm is typically significantly more efficient than 
the Metropolis algorithm since no selected moves are 
rejected. In particular, for problems such as thin-film 
growth in which the possible rates or probabilities for 
events can vary by several orders of magnitude, the ki- 
netic Monte Carlo algorithm can be orders of magnitude 
more efficient than Metropolis Monte Carlo. 

The standard KMC algorithm is a serial algorithm 
since only one event can occur at each step. However, 
for some problems one needs to simulate larger length 
and time-scales than can be simulated using a serial al- 
gorithm. For these problems it would be desirable to 



develop efficient parallel kinetic Monte Carlo algorithms 
so that many processors can be used simultaneously in 
order to carry out realistic computations over extended 
time- and length-scales. 

Recently there has been a great deal of work on asyn- 
chronous parallel algorithms for Metropolis Monte Carlo 
using domain decomposition. In particular, because the 
attempt time in Metropolis Monte Carlo is independent 
of system configuration, an asynchronous "conservative" 



algorithm may be used 
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In such an algorithm 



all processors whose next attempt time is less than their 
neighbor's next attempt times are allowed to proceed. 
Unfortunately such a "conservative" algorithm does not 
work for kinetic Monte Carlo since in KMC the event- 
time depends on the system configuration. In particular, 
since fast events may "propagate" across processors, the 
time for an event already executed by a processor may 
change due to earlier events in nearby processors, thus 
leading to an incorrect evolution. As a result, the devel- 
opment of efficient parallel algorithms for kinetic Monte 
Carlo simulations remains a challenging problem. 

A more efficient version of the conservative asyn- 
chronous algorithm for Metropolis Monte Carlo has been 
developed by Lubachevskj^ in the context of Ising simu- 
lations and has been implemented by Korniss et alJ^ In 
this approach, "n-fold way'— simulations are carried out 
in the interior of each processor, while Metropolis sim- 
ulations are carried out at the boundary. At each step, 
either an interior move or a boundary move is selected 
with the appropriate probability. While all "n-fold way" 
interior moves are immediately accepted, all Metropo- 
lis attempts must wait until the neighboring processor's 
next attempt time is later before being either accepted 
or rejected. Since such an algorithm is equivalent to the 
conservative Metropolis Monte Carlo algorithm described 
above, it is generally scalablej2*i2*ii and has been found 
to be relatively efficient in the context of kinetic Ising 
model simulations in the metastable regimeiiSii&ii 

Such an approach can be generalized^ in order to 
carry out parallel KMC simulations by mapping all KMC 



moves to a Metropolis move with an acceptance probabil- 
ity given by the rate for that event divided by the fastest 
possible rate in the KMC simulation. However, because 
of the possibility of significant rejection of boundary 
events, the parallel efficiency may be very low for prob- 
lems with a wide range of rates for different processes. 
For example, we have recently^ used such a mapping to 
carry out parallel KMC simulations of a simple 2D solid- 
on-solid "fractal" model of submonolayer growth with a 
moderate monomer ratio D/F = 10 5 of monomer hop- 
ping rate D to (per site) deposition rate F. However, 
due to the rejection of boundary events, an extremely 
low parallel efficiency was obtained^ Furthermore, in 
order to use such an approach, in general one needs to 
know in advance all the possible events and their rates 
and then to map them to Metropolis dynamics so that 
all events may be selected with the appropriate probabil- 
ities. While such a mapping may be carried out for the 
simplest models, for more complicated models it is likely 
to be prohibitive. 

A more efficient algorithm, which is also rigorous, is 
the synchronous relaxation (SR) algorithmiiS^i This al- 
gorithm was originally used by Eick et ali& to simulate 
large circuit-switched communication networks and more 
recently by Lubachevsky and Weissii in the context of 
fsing model simulations. In this approach, all processors 
remain synchronized at the beginning and end of a time 
interval T, while an iterative relaxation method is used 
to correct errors due to boundary events. This algorithm 
has the advantage of generality (for example, it is not 
necessary to know the types and/or rates of all possible 
events in advance) and flexibility since the cycle length 
T can be dynamically tunecU& to optimize the parallel 
efficiency. 

Recently, we have studied the parallel efficiency and 
applicability of the SR algorithm in parallel KMC sim- 
ulations of epitaxial growth^ and have found that in 
some cases a reasonable parallel efficiency can be ob- 
tained. However, due to fluctuations (which increase 
logarithmically^ with the number of processors N p ) as 
well as the requirement of global communications at the 
end of each cycle (the global communications time also 
increases logarithmically with N p ) the computational 
speedup as a function of N p is sublincar for fixed pro- 
cessor size. In addition, implementing such an algorithm 
is relatively complex. Therefore, there is a need for a 
somewhat simpler and more efficient algorithm. 

In order to address these problems, we have developed 
a simpler synchronous sublattice (SL) algorithm for par- 
allel kinetic Monte Carlo which we describe in detail here. 
While the SL algorithm is not rigorous, we find that us- 
ing certain reasonable assumptions on the cycle length 
and processor size, the results obtained are identical to 
those obtained in serial simulations. Furthermore, be- 
cause the SL algorithm requires only local communica- 
tions, the parallel efficiency is essentially independent of 
the number of processors in the large N p limit, thus lead- 
ing to linear scaling. As a result, the parallel efficiency is 



in general significantly greater than for the synchronous 
relaxation algorithm. 

The organization of this paper is as follows. In Section 
II we describe the algorithm. In Section III we present 
results obtained using this algorithm for several different 
models of thin-film growth, including a comparison with 
serial results. We also study the effects of fluctuations 
on the parallel efficiency and present results for the mea- 
sured and theoretical parallel efficiency as a function of 
processor size and number of processors. The effects of 
finite processor size on the accuracy of the results ob- 
tained using our algorithm are also discussed and com- 
pared with finite-size effects due to finite system-size. Fi- 
nally, in Section IV we summarize our results and discuss 
the general applicability of the SL algorithm to parallel 
kinetic Monte Carlo simulations. 



II. SYNCHRONOUS SUBLATTICE 
ALGORITHM 

As in previous work on the "conservative" asyn- 
chronous algorithm, §*2i in the synchronous sublattice (SL) 
algorithm, different parts of the system are assigned via 
spatial decomposition to different processors. However, 
in order to avoid conflicts between processors due to the 
synchronous nature of the algorithm, each processor's do- 
main is further divided into different regions or sublat- 
tices (see Fig. QJ. A complete synchronous cycle cor- 
responding to a time interval T is then as follows. At 
the beginning of a cycle each processor's local time is ini- 
tialized to zero. One of the sublattices is then randomly 
selected so that all processors operate on the same sub- 
lattice during that cycle. Each processor then simultane- 
ously and independently carries out KMC events in the 
selected sublattice until the time of the next event ex- 
ceeds the time interval T (see Fig. |2J). As in the usual 
serial KMC, each event is carried out with time incre- 
ment Ati = —ln{ri) j Ri where r, is a uniform random 
number between and I and Ri is the total KMC event 
rate. Each processor then communicates any necessary 
changes (boundary events) with its neighboring proces- 
sors, updates its event rates and moves on to the next 
cycle using a new randomly chosen sublattice. 

Fig. ^ shows two possible methods of spatial and sub- 
lattice decomposition which are appropriate for simula- 
tions of thin-film growth— a square sublattice decompo- 
sition (Fig. ^a)) and a strip sublattice decomposition 
(Fig. njb)). In the square sublattice decomposition, the 
system is divided into squares, each of which is assigned 
to a different processor, and each processor's domain is 
further divided into 4 square sublattices. At the begin- 
ning of each cycle one of the 4 sublattices (A,B,C, or D) 
is randomly chosen. In the strip-sublattice geometry, the 
system is divided into strips, each of which is assigned 
to a different processor, and each processor's domain is 
further divided into 2 strips or sublattices. At the be- 
ginning of each cycle one of the 2 sublattices (A or B) is 



randomly chosen. 

In order to avoid conflicts, the sublattice size must 
be larger than the range of interaction (typically only 
a few lattice units in simulations of thin-film growth). In 
addition, in order for each processor to calculate its event 
rates, the configuration in neighboring processors must 
be known as far as the range of interaction. As a result, 
in addition to containing the configuration information 
for its own domain, each processor's array also contains 
a "ghost-region" which includes the relevant information 
about the neighboring processor's configuration beyond 
the processor's boundary. 

At the end of each cycle, each processor exchanges 
information with its neighboring processors in order to 
properly update its corresponding boundary and ghost 
regions. For example, if sublattice A is selected in the 
case of square-sublattice decomposition, then at the end 
of a cycle, possible boundary events must be communi- 
cated to the three processors north, west and northwest 
of each processor. By using sequential north and west 
communications, one can eliminate the northwest com- 
munication, and so only two communications are needed 
at the end of each cycle. Similarly, if sublattice B is se- 
lected in the case of strip-sublattice decomposition, then 
at the end of a cycle, possible boundary events must be 
communicated to the processor to the east. 

Since moves are only allowed in the selected sublat- 
tice during a cycle, several cycles are needed for the en- 
tire system time to progress by T . Thus, in the square 
(strip) geometry, it takes on average 4 cycles (2 cycles) 
to increase the overall system time by T. During each 
cycle, the event rates in the non-selected sublattices of a 
given processor are automatically updated as each event 
proceeds, just as in the usual serial KMC. Sublattice se- 
lection can be carried out either by having one processor 
select the sublattice for that cycle and then distribute it 
to all processors, or more efficiently by seeding all pro- 
cessors with the same random number generator so that 
they all independently select the same sublattice for each 
cycle. 

We note that due to the reduced communication in the 
strip-sublattice decomposition compared to the square- 
sublattice decomposition, the strip-sublattice decompo- 
sition is more efficient. In addition, since the sublattice 
in the strip-geometry is twice as large as for the square- 
geometry for the same processor size N x N y , there will 
be twice as many events per cycle in the strip geometry 
thus further reducing the overhead due to communica- 
tion time. Thus, we expect that the overhead due to 
communication latency in the strip-geometry will be ap- 
proximately one-half of that for the square-geometry 

We now consider the validity and efficiency of the syn- 
chronous sublattice (SL) algorithm. If the time interval T 
is not too large, then the SL algorithm corresponds to al- 
lowing different sublattices to get slightly "out of synch" 
during each cycle. Over many cycles one expects such 
fluctuations to cancel out and so the parallel evolution 
should be identical to the corresponding serial KMC sim- 
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FIG. 1: Diagram showing (a) square sublattice decompo- 
sition (9 processors) and (b) strip sublattice decomposition 
(3 processors). Solid lines correspond to processor domains 
while dashed lines indicate sublattice decomposition. Dot- 
ted lines in (a) and (b) indicate "ghost- region" surrounding 
central processor. 
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FIG. 2: Diagram showing time-evolution in SL algorithm. 
Dashed lines correspond to selected events, while dashed line 
with an X corresponds to an event which is rejected since it 
exceeds the cycle time. 



ulation. Of course, in order to maximize the efficiency of 
the algorithm (i.e. the average number of events per pro- 
cessor per cycle) and minimize the communication time 
overhead, one would like to have the largest possible value 
of T which does not "corrupt" the time-evolution. As we 
shall demonstrate below, by picking the cycle length T 
less than or equal to the average time for the fastest pos- 
sible activated event (e.g. monomer hopping in the sim- 
plest possible model of thin-film growth) we do indeed 
obtain (except for very small processor sizes for which 
finite-size effects may occur) results which are identical 
to those obtained in serial KMC except for very small 
sublattice sizes. Thus, by using the general rule that 



the time interval T must be smaller than or equal to 
the inverse of the fastest possible event rate in the KMC 
table, we expect that the synchronous algorithm will pro- 
vide accurate results for sufficiently large sublattices. We 
note that the synchronous sublattice algorithm can also 
be used in a "self-learning" KMG±2 in which the KMC 
rate-tables are updated as the simulation goes along. In 
this case, if a new "fastest-event-rate" is discovered in 
the middle of a cycle, then one merely restarts the cycle 
from the beginning using a smaller cycle time T. 
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FIG. 3: Schematic diagram of island-relaxation mechanisms 
for (a) edge-and-corner and (b) reversible models. 



III. RESULTS 

In order to test the performance and accuracy of our 
synchronous sublattice algorithm we have used it to sim- 
ulate three specific models of thin- film growth. In partic- 
ular, we have studied three solid-on-solid (SOS) growth 
models on a square lattice: a "fractal" growth model, an 
edge-and-corner diffusion (EC) model, and a reversible 
model with one-bond detachment ("reversible model"). 
In each of these three models the lattice configuration 
is represented by a two-dimensional array of heights and 
periodic boundary conditions are assumed. In the "frac- 
tal" model, 20 atoms (monomers) are deposited onto a 
square lattice with (per site) deposition rate F, diffuse 
(hop) to nearest-neighbor sites with hopping rate D and 
attach irreversibly to other monomers or clusters via a 
nearest- neighbor bond (critical island size of 1). The 
key parameter is the ratio D/F which is typically much 
larger than one in epitaxial growth. In this model frac- 
tal islands are formed in the submonolayer regime due 
to the absence of island relaxation. The EC model is 
the same as the fractal model except that island relax- 
ation is allowed, i.e. atoms which have formed a single 
nearest-neighbor bond with an island may diffuse along 
the edge of the island with diffusion rate D e — r e D and 
around island-corners with rate D c = r c D (see Fig. [3J. 
Finally, the reversible model is also similar to the frac- 
tal model except that atoms with one-bond (edge-atoms) 
may hop along the step-edge or away from the step with 
rate D\ — r±D, thus allowing both edge-diffusion and 
single-bond detachment. For atoms hopping up or down 
a step, an extra Ehrlich-Schwoebel barrier to interlayer 
diffusion^! may also be included. In this model, the crit- 
ical island size *2£ can vary from i = 1 for small values of 
r\ , to i = 3 for sufficiently large values of D/F and r± ~ 

For the fractal and reversible models, the range of in- 
teraction corresponds to one nearest-neighbor (lattice) 
spacing, while for the EC model it corresponds to the 
next-nearest-neighbor distance. Thus, for these mod- 
els the width of the "ghost-region" corresponds to one 
lattice-spacing. We note that at each step of the simu- 
lation, either a particle is deposited within the appropri- 
ate sublattice, or a particle diffuses to a nearest-neighbor 
or next-nearest-neighbor lattice site. In order to avoid 
"double-counting" , only particles within a processor's do- 
main may diffuse, e.g. if a particle diffuses from the 



boundary region of a processor into its ghost-region dur- 
ing a cycle, then that particle is no longer free to move 
during that cycle. In more general models, for which con- 
certed moves involving several atoms may occur ^25*22*21 
the ghost region needs to be at least as large as the range 
of interaction and/or the largest possible concerted move. 
In such a case, the processor and sublattice to which a 
concerted event belongs can be determined by consider- 
ing the location of the center-of-mass of the atoms in- 
volved in the concerted move. 

In order to maximize both the serial and parallel ef- 
ficiency in our KMC simulations, we have used lists to 
keep track of all possible events of each type and rate. 
For each sublattice, a set of lists is maintained which 
contains all possible moves of each type. A binary tree 
is used to select which type of move will be carried out, 
while the particular move is then randomly chosen from 
the list of the selected type. After each move, the lists 
are updated. 



A. Computational Details 

In order to test our algorithm we have carried out both 
"serial emulations" as well as parallel simulations. How- 
ever, since our main goal is to test the performance and 
scaling behavior on parallel machines we have primar- 
ily focused on direct parallel simulations using the Ita- 
nium and AMD clusters at the Ohio Supercomputer Cen- 
ter (OSC) as well as on the Alpha cluster at the Pitts- 
burgh Supercomputer Center (PSC). All of these clus- 
ters have fast communications— the Itanium and AMD 
clusters have Myrinet and the AlphaServer cluster has 
Quadrics. In our simulations, the interprocessor commu- 
nications were carried out using MPI (Message-Passing 
Interface) . 



B. Comparison with Serial Results 

As a test of our algorithm we first present some de- 
tailed comparisons with serial results for different num- 
bers of processors and system sizes for both the square 
and the strip geometries. Fig. 0] shows a comparison 
of parallel and serial results for the fractal model with 
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FIG. 4: Comparison between serial and parallel results us- 
ing synchronous sublattice algorithm with strip decomposi- 
tion (L = N V ) for fractal model with D/F = 10 5 . 
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synchronous sublattice algorithm with strip decomposition for 
EC model with D/F = 10 5 , L = 256, and D e = 0.1D, D c = 0. 



D/F = 10 5 and a square system of size L = 256. The 
parallel simulations were carried out using a strip sublat- 
tice decomposition with processor sizes N x = 16, 32 and 
64 with N y — 256 corresponding to N p — 16, 8, and 4 
respectively, where N p is the number of processors. In 
particular, Fig. 0Ja) shows the substrate monomer den- 
sity Ni and island density N (averaged over 500 runs) 
as a function of coverage in the first half-layer of growth, 
while Fig. 01b) shows the r.m.s. surface height fluc- 
tuations or surface width (averaged over 100 runs) as a 
function of coverage in the first few layers of growth. The 
inset of Fig. E[b) also shows the monomer density as a 
function of coverage in the first 5 layers of growth. As 
can be seen, there is no difference within statistical er- 
ror between the serial and the parallel results. A similar 
comparison is shown in Fig. [SJfor the edge-diffusion (EC) 



model (D/F = 10 5 



0.1, r c — 0) using a strip sub- 



lattice decomposition. As can be seen there is again no 
difference between the parallel and serial results. 



C. Parallel Efficiency as a Function of D/F 

We now consider the performance of the synchronous 
sublattice algorithm, starting with the dependence of the 



parallel efficiency on the monomer diffusion rate D/F 
for the fractal model for a fixed number of processors 
(N p = 4). Here we define the parallel efficiency as equal 
to the ratio of the execution time for an ordinary se- 
rial simulation of one processor's domain (using the same 
sublattice decomposition as in the parallel simulation) to 
the parallel execution time of N p domains using N p pro- 
cessors. Thus, the overall "performance factor" of the 
parallel simulation (e.g. boost in events/sec over a serial 
simulation) is given by the parallel efficiency multiplied 
by the number of processors N p . 

There are two primary factors which determine the 
parallel efficiency. The first is the overhead due to com- 
munications at the end of every cycle, when all proces- 
sors exchange boundary information with their neigh- 
bors. Since in our simulations the number of boundary 
events is relatively small (i.e. the processor size is not too 
large) the primary cause of communications overhead is 
the latency time for local communications which is in- 
dependent of processor domain size. The second impor- 
tant factor controlling the efficiency is the existence of 
fluctuations in the number of events in different proces- 
sors. In particular, in any given cycle one processor may 
have many events, while its nearest-neighbor may have 
fewer events. As a result, while the processor with many 



events is calculating its events, its neighboring proces- 
sor with few events must idle (wait) until it has received 
the boundary information from the first processor before 
moving to the next cycle. 

To illustrate this effect more quantitatively, we con- 
sider the effects of fluctuations on the parallel efficiency 
in the case of the one-dimensional strip sublattice de- 
composition shown in Fig. ^(b). In this case there are 
two sublattices (A and B) and during each cycle one of 
the sublattices is randomly selected. For example, if the 
B sublattice is selected, then at the end of a cycle all 
processors will do a (non-blocking) send of any bound- 
ary events in the B sublattice to the processor on their 
right, followed by a (blocking) receive of the correspond- 
ing boundary information from the processor on their 
left. Thus, for example, if processor 1 has more events 
than processor 2, and so takes longer to execute these 
events before initiating its send to processor 2, then pro- 
cessor 2 must wait before moving to the next cycle, thus 
leading to inefficiency. However, processor 2's execution 
is not affected by processor 3 during the same cycle, since 
its send to processor 3 is non-blocking. 

Denoting the communication overhead per cycle as 
tcom and taking into account the fluctuations of events 
between nearest-neighbor processors, we obtain the fol- 
lowing expression for the average time per cycle: 

tav(Np) = tip + tcom + (A(t)) (tlp/Uav) (1) 

where t\ p is the time for a single processor serial simu- 
lation of a single processor's domain, n av is the average 
number of events per processor per cycle, A(r) is the 
relevant fluctuation in a given cycle r (averaged over all 
processors), and the brackets denote an average over all 
cycles. The ratio t\ p /n av in the last term of Eq. I cor- 
responds to the average calculation time to process an 
event. Therefore, the parallel efficiency PE may be writ- 
ten as 
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In the limit of negligible communication time t com jt\ v — > 
0, this implies that the maximum possible parallel effi- 
ciency is given by, 
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There are two distinct ways in which the average fluc- 
tuation A(t) might be calculated. If we assume that at 
the beginning of each cycle all processors are perfectly 
synchronized, then for the strip geometry one may write, 



1 n p 
As(r) = -^^(n i+l5 ( T )(T)-n;(T)) e ( n i+s(T)(r)-ni(T)) 

P i— 1 

(5) 
where n^r) is the number of events in processor i in 
cycle r, Q(x) = 0(1) if a; is negative (positive) and 5(t) = 
+1(— 1) if the A (B) sublattice is selected in cycle r. Since 
we are interested in the average over many cycles, this is 
equivalent to the simpler form, 



As(r) = — J2 Mr) - th+i(t)| 



(6) 



where the factor of 1/2 is due to the fact that only half 
the time will the relative fluctuation in the relevant neigh- 
boring processor be positive, and thus lead to a delay. 

However, due to fluctuations one must also take into 
account the existence of desynchronization at the begin- 
ning of a cycle. In order to take this into account, we may 
calculate the sum or "starting time" Si (r) corresponding 
to the sum of the total number of events in processor i 
and the sum of all delay-events due to neighboring pro- 
cessors in a given processor i at the start of cycle r. At 
the start of the first cycle (r = 1) one has Si(l) = 
for all processors i and n,(l) is the number of events in 
processor i in that cycle. At the start of each subsequent 
cycle, the sum Si(r) may be calculated in each processor 
in terms of the previous values of Si(r — 1) and riiij — 1) 
as follows, 

Si(r) = Si(r - 1) + n % {r - 1) + A l (r)9(A l (r)) (7) 

where 

Aj(r) = S , i+< 5 (T) (r-l)+ra. 1+(5(r) (r-l)-S , i (T-l)-n i (r-l) 

(8) 
and where <5(t) = +1(— 1) if the A (B) sublattice is se- 
lected in cycle t. Then the average delay A(t) due to 
fluctuations in a given cycle t may be written, 



(3) 



N p 



(9) 



We also note that n av ~ N x N y and since the fluctua- 
tions are on average uncorrelated, one expects (A(r)) ~ 
y/riav This implies that the maximum possible parallel 
efficiency may be written as, 



pjjjmax _ r j 



(N x N y y/* 



(4) 



where the constant a is model-dependent. This result 
shows clearly that the maximum theoretical efficiency ap- 
proaches 1 in the limit of large n av corresponding to large 



N X ,N V . 



Fig. shows the measured fluctuations (A(r))/n al) 
and (Ag(r))/n n „ for the simple fractal model as a func- 
tion of D/F for fixed processor size N x = 256, N y = 1024 
and N p = 4. As can be seen, for N p = 4, the full fluc- 
tuation (A(r)) is approximately 30% larger than that 
obtained assuming that all processors are synchronized 
at the beginning of each cycle. For the simple fractal 
model, one expects n av ~ ATj ~ (D/F) -2 / 3 which im- 
plies (A(r))/n a „ ~ {D/F) 1 / 3 . As can be seen in Fig. 
there is very good agreement with this form for the 
D/F-dependence. 
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FIG. 6: Fluctuations as function of D/F for fractal model 
with N p = 4 and strip geometry with N x = 256, N y — 1024. 
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Fig. [7| shows the corresponding results (symbols) for 
the parallel efficiency as a function of the ratio D/F. Re- 
sults are shown for parallel KMC simulations with N p — 
4 of a square system with system size L = 1024 with both 
square sublattice decomposition (N x = N y = 512) and 
strip sublattice decomposition (N x = 256, N y = 1024). 
Due to the decreased communication overhead in the 
strip-geometry (1 send/receive versus 2 send/receives) 
the parallel efficiency of the strip geometry simulations 
is significantly larger than for the square geometry. As 
can be seen, for D/F < 10 6 , the parallel efficiency for 
the strip geometry is greater than 50%. However, with 
increasing D/F the parallel efficiency decreases signifi- 
cantly since the decrease in the number of events per 
cycle n av (see Fig. 0(a)) leads to an increase in the com- 
munications overhead t corn /t\ v as well as in the relative 
fluctuations (A(r))/n alI . 

Also shown in Fig. [7| (dashed lines) is the paral- 
lel efficiency calculated using Eq. 2 based on the mea- 
sured values of (A(r))/ri au , ti p /n av , and the measured 
interprocessor communication time t com ~ 15/xs per 
send/receive. As can be seen, there is good agree- 
ment between the measured and calculated results. The 
maximum theoretical parallel efficiencies calculated us- 
ing Eq. 3 assuming negligible communication overhead 
are also shown (solid lines). As can be seen, the maxi- 
mum theoretical efficiencies are significantly higher than 
the measured efficiencies for large D/F, although they 
also decrease with increasing D/F due to the increase 
in fluctuations. For the simple fractal model with strip- 
geometry, our results for the maximum possible parallel 



FIG. 7: (a) Events per cycle and (b) parallel efficiency for 
fractal model with N p = 4 as function of D/F. Dashed lines 
correspond to Eq. [2]while solid lines correspond to maximum 
theoretical efficiency given by Eq. |S| 



efficiency may be well described by the expression, 

PEf r a a X c = [ 1 + 0.21(C/F) 1 / 3 /(^A f y ) 1/2 P 1 (10) 

This result may be used to estimate the maximum possi- 
ble efficiency for the fractal model for different processor 
sizes and values of D/F. For example, if N x — N y = 64 
and D/F = 10 5 , then this implies a maximum possible 
parallel efficiency given by PE max = 0.40. This result 
shows that even in the absence of delays due to interpro- 
cessor communication, due to the existence of fluctua- 
tions, the parallel efficiency will decrease with increasing 
D/F. 

Fig. [S] shows similar results for the parallel efficiency 
as a function of D/F for the edge-diffusion model with 



D e = 0.1D,D C = 0, N p = 4 and 9 = XML. As can 
be seen, although the parallel efficiency for the edge- 
diffusion model still decreases with increasing D/F, it is 
significantly larger than for the fractal model. In partic- 
ular, due to the increased number of events per cycle and 
the resulting reduced communication overhead, the par- 
allel efficiency remains above 50% for large D/F. As an 
example, for the case of strip-geometry and D/F = 10 7 , 
the parallel efficiency is more than 3 times that for the 
simple fractal model, while the number of events is ap- 
proximately 10 times larger. As for the fractal model, 
the calculated parallel efficiency (dashed lines) is in good 
agreement with the measured values. Due to the increase 
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FIG. 8: (a) Events per cycle and (b) parallel efficiency for 
edge-diffusion model with N p — 4 as function of D/F. Dashed 
lines correspond to Eq. |5|while solid lines correspond to max- 
imum theoretical efficiency given by Eq. [3] 



in the number of events per cycle, and the resulting de- 
crease in the relative fluctuations, the maximum theoret- 
ical efficiencies (solid lines) are also significantly higher 
than for the fractal case. 



D. Parallel efficiency as function of number of 
processors for fixed processor size 

Fig. Elshows the performance (events/sec) for the sim- 
ple fractal model with D/F = 10 5 as a function of the 
number of processors N p with fixed processor size, us- 
ing both square decomposition with N = 512 and strip 
decomposition with N x = 256 and N y = 1024. As can 
be seen in both cases there is a roughly linear speedup 
of the performance with increasing number of processors 
N p . For comparison, the equivalent single-processor (se- 
rial) computation speed for this model is approximately 
2.8 x 10 5 events/sec. However, due to the decreased com- 
munication cost, the speed-up using the strip geometry 
is significantly higher than for the square decomposition. 

We now consider the dependence of the parallel effi- 
ciency on the number of processors N p in more detail for 
the case of strip geometry. Fig. [J3 shows the measured 
fluctuations (A(r))/n at , and (As(r)}/n av as a function of 
N p for the fractal model for D/F — 10 5 . With increasing 
N p , the relative event-fluctuation (As(r))/n a „ remains 
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FIG. 9: Total computational speed (events/sec) as function 
of number of processors for fractal model with D/F = 10 5 . 
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FIG. 10: Fluctuations for fractal model with D/F = 10 5 
as function of N v . Solid line corresponds to fit of form 

{As(T))/n av = 0.3 - 0.28/7V p - 68 



constant. In contrast, the full fluctuation (A(r))/n a „ 
increases slowly with increasing N p but appears to satu- 
rate at a finite value for large N p . This is supported by 
the fit shown in Fig. QJ]| (solid line) which agrees quite 
well with the simulation results and which has the form, 
(A s (r))/n a „ = 0.30 - 0.28/7V° 68 . Due to the saturation 
of fluctuations, we expect that the parallel efficiency will 
also saturate for large N p . 
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FIG. 11: Parallel efficiency (symbols) as function of number 
of processors N p for fractal and edge-diffusion models with 
D/F = 10 5 and strip-sublattice decomposition (N x = 256 
with N y — 256 and 1024). Dashed lines correspond to Eq. |5] 
while solid lines correspond to maximum theoretical efficiency 
calculated using Eq. |3] 



Fig. II II shows our results for the measured parallel ef- 
ficiency (open and closed symbols) as a function of N p 
for fixed processor size for the fractal and edge-diffusion 
models. As expected, the parallel efficiency is essentially 
constant for large N p , although there is a slight decrease 
due to increased communication overhead for N p > 100. 
Also shown (dashed lines) are the parallel efficiencies cal- 
culated from Eq. fusing the measured fluctuations and 
communication times, as well as the maximum possible 
theoretical parallel efficiencies (solid lines) calculated us- 
ing Eq. [3J As can be seen, there is relatively good agree- 
ment between the calculated and measured parallel effi- 



We note that the results for large N p (open symbols) 
were obtained using the Alpha cluster at the Pittsburgh 
Supercomputer Center (PSC) for which the communica- 
tion latency is somewhat larger than for the OSC clus- 
ter. As a result the parallel efficiencies are somewhat 
lower than would be obtained with the OSC cluster. 
For comparison, OSC results for the fractal model with 
N x = N v = 256 and with N x = 256, N y = 1024 are 
also shown up to N p — 64 (filled symbols). As can be 
seen, due to the decreased communication time, the OSC 
results for the parallel efficiency are significantly larger 
than the corresponding PSC results. Except for the PSC 



E. Finite-Size Effects 

We now consider the effects of finite processor size 
on the accuracy of the results obtained using the syn- 
chronous sublattice algorithm. For simplicity we focus on 
the case of strip-sublattice decomposition. As we have al- 
ready shown (see Figs. 0]and|SJl, for sublattice sizes which 
are not too small, there is perfect agreement between the 
synchronous sublattice results and the corresponding se- 
rial results. However, for very small processor sizes there 
exists a small "finite-size" effect which leads to results 
which arc slightly different from those obtained using the 
usual serial KMC algorithm. In particular, as shown in 
Fig. ^1 there is essentially perfect agreement between 
the synchronous sublattice results for the fractal model 
with system size L = 256, D/F = 10 5 , and N x = 16-256 
and the corresponding serial results. However, for the 
smallest processor size (N x = 8) there is approximately 
a 2% difference between the synchronous sublattice re- 
sults for the peak island density N and the correspond- 
ing serial results (although there are no differences in the 
monomer density N\). In order to compare these effects 
with those of finite system size, in Fig. El( c ) an d (d) we 
also show the corresponding serial results for systems of 
size N x — 8 — 256 and N y = 256. As can be seen, the 
finite-size effects which occur for small system size are 
much larger than those due to finite processor size. This 
indicates that the effects of finite processor size are both 
qualitatively different as well as much weaker than those 
due to finite-system size. 

We now consider these finite-size effects in somewhat 
more detail. While a variety of length scales (such as the 
typical mound or feature size in multilayer growth) may 
occur in the models studied here, there is one dynami- 
cal length scale corresponding to the "diffusion length" 
Id (e.g. the typical distance a monomer may diffuse be- 
fore being captured) which plays a particularly important 
role. In particular, the diffusion length may be written 
in terms of the peak submonolayer island density, i.e. 
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256, the parallel efficien- 



cies are all larger than 50%. 



£> ~ iv k ■ . Since in the synchronous sublattice algo- 
rithm, particles which diffuse outside the active sublat- 
tice are no longer active during that cycle, we conjecture 
that for sublattice sizes (N x /2) which are smaller than 
the diffusion length Id, finite-size effects may occur. For 
the fractal model studied here, one has N p k ~ (D/F) -1 / 3 
which implies Id ~ (D/F) 1 / 6 , e.g. the diffusion length 
increases slowly with increasing D/F. As shown in Fig. 
IT31 by measuring the peak island density for D/F = 10 5 , 
we obtain Id — 11 which implies a critical processor size 
N x given by N x ~ 2 Id — 22. This result is in good 
agreement with the observation of the onset of signifi- 
cant finite-size effects for N x < 16. 

Also shown in Fig. ^|are similar results for the edge- 
diffusion model with D/F = 10 5 and r e = 0.1. In this 
case the diffusion length is slightly higher than for the 
fractal model. However, again there are no finite-size 
effects for N x > 21 d- Similar results have also been ob- 
tained for the reversible one-bond detachment model, as 
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FIG. 12: Finite-size effects in parallel and serial simulations 
of fractal model with D/F — 10 5 . Parallel simulations (aver- 
aged over 200 runs) are for system size L = 256 with processor 
sizes N x = 8, 16, 32, 64, and 128 and N y = 256. Serial simu- 
lations (averaged over 500 runs) are for system size N x x 256. 
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well as a reversible bond-counting model (not shown) . In 
all cases, we find that there are no differences between 
the serial results and the parallel KMC results as long 
as N x > 21 n- In contrast, for N x < 21 d, noticeable but 
weak finite-size effects are observed. 

While these results are for a cycle length T = l/D 
given by the inverse of the fastest hopping rate, for a 
smaller cycle-length we expect that the critical processor 
size N x corresponding to finite-size effects will be signif- 
icantly reduced. As shown in Fig. ^| (filled symbols) 
for the fractal model with cycle length T = 1/(6D), the 
critical processor size N x is significantly smaller than the 
diffusion length Id- However, for such a reduced cycle 
length, the parallel efficiency is also significantly reduced. 

As a further test of both the parallel efficiency and 
finite-size effects in the SL algorithm, we have carried out 
multilayer simulations of the reversible model at T = 300 
K, with system size L = 1024, D/F = 10 5 , E x = 0.1 
eV, and an Ehrlich-Schwoebel barrier to interlayer dif- 
fusion given by Eb — 0.07 eV. In our simulations par- 
ticles freshly deposited near step-edges are assumed to 
"cascade" randomly to the nearest-neighbor sites below 
( "knockout" ) . Fig ^] (a) shows serial results (solid line) 
for the r.m.s. surface height fluctuations (surface width) 
and monomer density up to 500 ML along with the cor- 
responding parallel results obtained using the SL algo- 
rithm with processor sizes N y — 1024 and N x = 64, 128, 
and 256 corresponding to N p = 16, 8, and 4 respectively. 
As can be seen, there is no difference between the serial 
and parallel results even though the typical mound size of 
approximately 100 lattice units (see Fig. UlT bll is signifi- 
cantly larger than the smallest sublattice size N x /2 = 32. 
This indicates that the relevant length-scale determining 
the existence of finite-size effects in the SL algorithm is 
indeed the diffusion length Id and not the characteristic 
feature size. Since in these simulations the total system 
size L was fixed, the parallel efficiency may be written, 



PE = 



lp 



N p t av {N p ) 



(11) 



where t\ p is the calculation time for a serial simulation of 
the entire system. Since the processor size decreases with 
increasing N p , both the relative magnitude of event fluc- 
tuations and the overhead due to communication latency 
will also increase. As a result, the parallel efficiency de- 
creases with increasing N p rather than saturating as in 
the case of fixed processor-size. The parallel efficiencies 
obtained in these simulations were 92% (N p = 4), 85% 
(N p = 8), and 70% (N p = 16) respectively. 
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FIG. 13: Diffusion length In — N . in parallel and se- 
rial simulations of fractal model (200 runs) and edge-diffusion 
model (100 runs) with D/F = 10 5 and T = l/D (open sym- 
bols). All simulations are for system size L = 256 with pro- 
cessor sizes A?* = 8, 16, 32, 64, 128, and 256 and N y = 256. 
Horizontal dashed lines correspond to error bars for serial 
simulations. 



IV. DISCUSSION 

We have developed and tested a synchronous sublattice 
(SL) algorithm for parallel kinetic Monte Carlo simula- 
tions. In our algorithm, the maximum cycle length T is 
given by the inverse of the fastest diffusion rate. For sub- 
lattice sizes which are smaller than the diffusion length 
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FIG. 14: (a) Comparison between serial and parallel results 
for reversible model with L = N y = 1024, D/F = 10 5 , Ej = 
0.1 eV and E b = 0.07eV (N p = L/N x ). (b) Gray-scale plot 
of 512 x 512 portion of system at = 500 ML. 



Id, weak finite-size effects are observed which lead to 
deviations from the results obtained using a serial algo- 
rithm. However, for sublattice sizes larger than the diffu- 
sion length Id , the results obtained are identical to those 
obtained in serial simulations. Since in many systems of 
interest the diffusion length is typically relatively small 
(e.g. of the order of a few to a few tens of lattice spac- 
ings) while significantly larger system sizes are needed to 
avoid finite system-size effects, the SL algorithm should 
provide a useful, efficient, and accurate method to carry 
out parallel KMC simulations of these systems. 

We have also measured the parallel efficiency of the 
SL algorithm as a function of the number of processors 



N p for fixed processor size. Because the SL algorithm 
is synchronous, the parallel efficiency is affected by fluc- 
tuations in the number of events in different processors 
over a given cycle. However, because only local commu- 
nications are required, these fluctuations saturate as the 
number of processors increases. As a result, linear scal- 
ing behavior for the total speedup as a function of the 
number of processors is observed, e.g. the parallel effi- 
ciency is independent of the number of processors in the 
large N p limit. 

For the simple models we have studied here, the cal- 
culation time for a single event such as diffusion or de- 
position is significantly smaller than the latency time for 
nearest-neighbor communication. As a result, the par- 
allel efficiency increases with processor size, since the 
communications overhead per event is reduced by the 
increased number of events in a cycle. However, even for 
relatively modest processor sizes, we have obtained rea- 
sonable values for the asymptotic parallel efficiency PE 
ranging from 50% for the fractal model with D/F = 10 s 
and N x = N y = 256, to 70% for the fractal model with 
N x = 256, N y = 1024. For the slightly more compli- 
cated edge-diffusion (EC) model, for which the number 
of events per cycle is larger, significantly larger efficien- 
cies are obtained for the same processor size, e.g. 60% 
for N x = N y = 256 and D/F = 10 5 . Similarly, for 
the reversible model, we have obtained a parallel effi- 
ciency PE ~ 70% for the same effective processor size 
(N x = 64, N y = 1024) with N p = 16. 

We have also studied the effects of fluctuations on the 
parallel efficiency in our simulations. In particular, we 
found that the relevant relative fluctuations (As(r))/n a „ 
scale as one over the square root of the processor size (see 
Eg 1 101 and FigJBJ. By taking into account the effects of 
fluctuations and communication delays, calculated paral- 
lel efficiencies were obtained which are in excellent agree- 
ment with those obtained in our simulations. In addition, 
by measuring the relevant fluctuations, we have been able 
to predict the maximum possible theoretical efficiencies 
in the absence of communication delays. For example, 
for the fractal and edge-diffusion models with D/F = 10 5 



and N T 



N, 



256, maximum theoretical parallel effi- 



ciencies of 80% and 90% respectively were obtained. 

Since increasing the processor size decreases the ef- 
fects of fluctuations as well as communications overhead, 
the parallel efficiency may be further increased by in- 
creasing the processor size. Alternatively, in simulations 
on machines with faster communications (such as shared 
memory machines) or in simulations of more complicated 
KMC models for which the calculation time is signifi- 
cantly larger (such as self-learning KMG-*i or accelerated 
dynamics 27 ) efficiencies approaching these values may be 
possible even without increasing the processor size. 

It is worth noting that in our simulations we have used 
two slightly different definitions for the parallel efficiency. 
In the first definition (Eq. |2J), the parallel execution time 
was compared with the serial execution time of a system 
whose size is the same as a single processor. In contrast, 
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in the second definition (Eq. Illjl the parallel execution 
time was directly compared with 1/N p times the serial 
execution time of a system whose total system size is the 
same as in the parallel simulation. If the serial KMC 
calculation time per event is independent of system size, 
then there should be no difference between the two def- 
initions. However, in general this may not be the case. 
For example, in KMC simulations in which the rates for 
all events are stored separately, the calculation time per 
event will increase as M 1 / 2 (where M = L x L y is the sys- 
tem size) using the Maksym algorithm^ and as log(M) 
using a binary tree algorithm^ In this case, the parallel 
efficiency calculated using Eq. ^] may be significantly 
higher than that obtained using Eq. [21 and may even be 
larger than 1, since the division of a system into smaller 
subsystems may reduce the calculation time per event 
per processor. 

Since in the models studied here we have used lists for 
each type of event, rather than the "partial-sum" algo- 
rithms described above, we would expect the serial cal- 
culation time per event to be independent of system size, 
and thus the two definitions of parallel efficiency should 
be equivalent. To test if this is the case, we have calcu- 
lated the serial simulation time per event for the fractal 
model for D/F = 10 3 and D/F = 10 5 for a variety of 
system sizes ranging from L = 64 to L = 2048. Some- 
what surprisingly, we found that the serial calculation 
time per event increases slowly with increasing processor 
size. In particular, an increase of approximately 50% in 
the calculation time per event was obtained when going 
from a system of size L = 64 to L = 2048. We believe 
that this is most likely due to memory or "cache" effects 
in our simulations. This increase in the serial calculation 



time per event with increasing system size indicates that 
the calculated parallel efficiencies shown in Fig. II II would 
actually be somewhat larger if the more direct definition 
of parallel efficiency (Eq. Illfl were used. However, since 
for large N p this requires serial simulations of very large 
systems (e.g. 256,000 x 256,000 for N p = 100), the first 
definition (Eq. [2J) was used. 

Finally, we return to the general question of the appli- 
cability and validity of the SL algorithm. In general, we 
expect that for a wide class of non-equilibrium processes 
there exists a clearly defined diffusion length lo, which 
may or may not vary slowly with time. For these pro- 
cesses we expect that finite-size effects will not occur as 
long as the sublattice size is larger than Id. Furthermore, 
as long as this length scale is not too large compared to 
the desired system size, parallel simulations using the 
SL algorithm will be advantageous. As our results in- 
dicate, parallel KMC simulations using the synchronous 
sublattice algorithm are in general likely to be signifi- 
cantly faster than either the conservative asynchronous 
algorithm or the synchronous relaxation algorithm. As 
a result, we expect that the synchronous sublattice algo- 
rithm may be particularly useful as a means to carry out 
a variety of parallel non-equilibrium simulations. 
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